- Why go Bayesian?
- Introduction to Bayesian analysis
- Bayesian estimation (getting the posterior distribution)
- MCMC sampling
- How good are my samples?
- Posterior predictive checks
- Model comparison
- Summarizing the posterior
Last updated: 2020-02-21
It tells you what you want to know!
Given the data (and our prior knowledge),
brms)\[ p(\theta \mid y) = \frac{p(\theta)p(y \mid \theta)}{p(y)} \]
\(p(y)\) does not depend on the model parameters so we can omit it in favor of the unnormalized posterior
\[ p(\theta \mid y) \propto p(\theta)p(y \mid \theta) \]
The posterior is proportional to the prior times the likelihood
\[ L(\mathbf{\theta \mid y}) = \prod^i f(\theta \mid y_i) \]
In R:
# likelihood of data for mu = 5, and sd = 10 prod(dnorm(x = Y, mean = 5, sd = 10)) # or exp(sum(dnorm(Y, 5, 10, log = T)))
## [1] 1.820996e-79
optim to search for the parameters that maximize the above (or minimize the negative log likelihood)"Weakly informative prior should contain enough information to regularize: the idea is that the prior rules out unreasonable parameter values but is not so strong as to rule out values that might make sense"
Going back to our reading time example, say we have a fairly good idea of the reading rate (words per second) of the general population:
For the standard deviation of reading times (\(\sigma\)), we might be less certain:
We can specify the priors separately
But they determine a joint distribution
Another example, linear regression:
\[ y_i \sim \mbox{Normal}(\beta_0 + \beta_1x_i, \; \sigma) \]
If \(x\) and \(y\) have been scaled (\(z\)-scored), a reasonable choice would be:
These priors essentially say that we expect either a positive or negative relationship between \(x\) and \(y\).
If we had strong reason to expect that \(y\) should increase with \(x\) we could instead use:
brms and Stan use the LKJ prior (after Lewandowski, Kurowicka, & Joe, 2009)Image from here
lme4 convergence can be an issuelme4 problem - the model is too complex to be supported by the dataMarkov Chain Monte Carlo
Goal of MCMC - to approximate a target distribution
We'll talk about 1 and 5. Slides on 2,3,4 at the end
Note: Warm up for brms/Stan is more complicated and serves to tune the sampling parameters
How do we know that we have converged onto a stable distribution?
For each step in the chain (or a random subset of the chain), simulate \(N\) observations from the model with parameters set to the current step in the chain (where \(N\) is the size of the original data set)
We can compare these simulated outcomes to the observed data
Both try to estimate the expected log predictive density (elpd) for new data
loo package (see Vehtari et al., 2017 for details)Larger elpd is better but note that LOO and WAIC are often reported on deviance scale (multiplied by -2), in which case smaller values indicate better fit.
brms examples), \(K\)-fold cross validation can be used\[ \frac{p(M_1 \mid y)}{p(M_2 \mid y)} = \frac{p(y \mid M_1)}{p(y \mid M_2)} \times \frac{p(M_1)}{p(M_2)} \]
\[ \frac{p(M_1 \mid y)}{p(M_2 \mid y)} = BF_{1,2} \times \frac{p(M_1)}{p(M_2)} \]
\[ p(y \mid M) = \int p(y \mid \theta, M) p(\theta \mid M) d\theta \]
BayesFactor package (only normal models)You will see both of these around…
brms)HDInterval::hdi() to calculate)Region Of Practical Equivalence
Steps we'll follow in our brms examples:
brms, Stan, etc) and ensure chains have converged (we'll cover other possible warnings/errors specific to Stan)Further reading:
The next slides show examples of simulated data for the linear regression example (\(y_i \sim \mbox{Normal}(\beta_0 + \beta_1x_i, \sigma)\)) - with 100 observations of x=0 and 100 of x=1
A way of estimating the number of independent samples once accounting for autocorrelation:
\[ ESS = \frac{N}{1 + 2\sum_{k = 1}^{\infty} \rho_k} \]
Where \(\rho_k\) is the auto-correlation at lag \(k\). Think of this as dividing the number of samples by the amount of auto-correlation. In practice the sum stops when the auto-correlation is small (e.g. \(\rho_k < 0.05\); see Kruschke, 2015, p. 184).
By discarding every \(k^{\mbox{th}}\) sample can reduce autocorrelation (below \(k=10\))
By discarding every \(k^{\mbox{th}}\) sample can reduce autocorrelation (below \(k=10\))